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ABSTRACT 


Talent management involves a lot of managerial decisions to allocate right 
people with the right skills employed at appropriate location and time. Authors 
report machine learning solution for Human Resource (HR) attrition analysis 
and forecast. The data for this investigation is retrieved from Kaggle, a Data 
Science and Machine Learning platform [1]. Present study exhibits performance 
estimation of various classification algorithms and compares the classification 
accuracy. The performance of the model is evaluated in terms of Error Matrix 
and Pseudo R Square estimate of error rate. Performance accuracy revealed that 
Random Forest model can be effectively used for classification. This analysis 
concludes that employee attrition depends more on employees' satisfaction level 
as compared to other attributes. 


INTRODUCTION 

The process to identifying the existing talent in an organization is among the top 
talent management challenges and the important issue. For every organization, 
human resource plays a vital role in all strategic decisions. Satisfied, highly- 
motivated and loyal employees represent the basis of a company and which in 
turn have impacts on the productivity of an organization. 

The prime objective of the present study is to analyze why some of the best and 
most experienced employees are leaving prematurely. This analysis also wishes 
to predict which valuable employees will leave next. 

The rest of paper is designed as follows; Introduction followed by the materials 
and methods utilized in the present study. Then the third section summarizes the 
results and discussions of the HR attrition analysis. The conclusion at the end 
justifies the suitability of Random Forest model for this talent mining. 


Materials and Methods 

The dataset for the present analysis is taken from Kaggle, Machine Learning platform [1]. This is the simulated dataset 
comprising 15000 employee records classified into two categories (left or not left) based on satisfaction level, latest 
evaluation, number of proj ect worked on, average monthly hours, time spend in the company, work accident, promotion within 
the past 5 years, department and salary. Table 1 gives description of employee dataset. 


Table 1: Employee dataset description for talent mining 


Attribute 

Description 

Data Type 

satisfactionjevel 

Level of satisfaction (0-1) 

Numeric 

last evaluation 

Time since last performance evaluation (in Years) 

Numeric 

number project 

Number of projects completed while at work 

Numeric 

average montly hours 

Average monthly hours at workplace 

Numeric 

time spend company 

Number of years spent in the company 

Numeric 

Work accident 

Whether the employee had a workplace accident 

Numeric 

Left 

Whether the employee left the workplace or not (1 or 0) 

Numeric 

promotion last 5years 

Whether the employee was promoted in the last five years 

Numeric 

sales 

Department in which they work for 

String 

salary 

Relative level of salary (high) 

String 
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This section explores details of experiment conducted for employee attrition analysis and forecasting. The present study is 
carried out using R and Rattle data mining platform [4]. Figure 1 shows summary of the HR dataset. Dataset is partitioned 
randomly into training, testing and validation with division 70%, 15 % and 15% respectively. We used the training dataset for 
parameter adjustment of model whereas validation set to control learning process. 
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Figure 1: Dataset exploration - Summary 


Among the vast machine learning algorithms, authors have picked Decision Tree, Random Forest, Support Vector Machine 
(SVM), and Linear Regression techniques to build the model. These algorithms are based on supervised learning and best 
known for building prediction models [8]. Supervised learning algorithms try to model relationships and dependencies 
between the target prediction output and the input features/ predictors such that we can predict the output values for new 
data based on those relationships which it learned from the previous data sets. 

Figure 2 explains Decision tree modeling of HR data. It begins with a root node "satisfaction level", that part into different 
branches, prompting to further nodes, each of which may additionally part or else end as a leaf node. Connected with each 
nonleaf node will be a test or question that figures out which branch to take after [7]. The leaf nodes indicate the attrition sates 
whether the employee "left" or "not left". Figure 3 gives pictorial representation of Decision tree thus derived. 
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Classification tree: 

rpart(formula = left -- ., data = crs^dstsset[crsStrain, c(crs$input, 

crs^target)]* ir.ethod = rr cld33 rr r parir.s = list (split = "inf orir.ation") , 
control = rpart. control (usesurrogate = 0, ir.axs arrogate = 0) J 

Varieties actually used in tree construction: 

|1] average_mont 1 y_tiours last_evaluation nuroter_project 

J4] satisfaction_level tiir.e_spend_coir.pany 

Root node error: 2435/10433 = 0.23663 

n= 10493 



CF 

nsplit 

rel error 

xerror 

xstd 

1 

0.240644 

0 

1.00000 

1.00000 

0.0175262 

2 

0.134303 

1 

0.75336 
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0.0153321 

3 
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3 
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0.0113231 

4 

0.054723 

5 
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0.0035733 

5 

0.031333 

6 

0.13712 

0.13333 

0.0035033 

6 

0.016301 

7 

0.15573 

0.15976 

0.0073650 

7 

0.010060 

3 

0.13333 

0.14165 

0.0074223 

8 

0.010000 

3 

0.12377 

0.13113 

0.0071521 


Tiir.e taken: 0.3 9 secs 

Figure 2: Decision tree modeling 
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Figure 3: Decision tree for HR attrition status 


Figure 4 explains Random Forest Modeling for HR attrition analysis. RANDOMFOREST package in R environment is employed 
here to analyze model structure [5 -6]. RF builds many decision trees using random subset of data and variables. Rattle provides 
access to three parameters such as the number of trees, sample size and number of variables for tuning the models. 
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Number of observations used to build the model: 10499 
Missing value imputation is active. 

Call: 

randomForest(formula = as.factor(left} ~ ., 

data = crs$dataset[crs$sample, c(crs$input, crs$target}], 

ntree = 200, mtry = 3, importance = TRUE, replace = FALSE, na.action = randomForest::na.roughfix} 

Type of random forest: classification 
Number of trees: 200 
No. of variables tried at each split: 3 

OOE estimate of error rate: 0.9% 

Confusion matrix: 

0 1 class.error 

0 3002 12 0.00149733 

1 32 2403 0.03299799 

Figure 4: Summary of the Random Forest Model 

Figure 5 explains Support Vector Machine (SVM) designed for the attrition analysis of employee data. SVM searches for support 
a vector that separates the class. 


Support Vector Machine object of class "ksvm" 

SY type: C-svc (classification) 
parameter : cost C = 1 

Gaussian Radial Basis kernel function. 
Hyperparair.eter : sigir.a = 0.109166570184432 

Number of Support Vectors : 1510 

Objective Function Value : -1188.127 
Training error : 0.03467 
Probability ir.odel included. 

Tirr.e taken: 10.30 secs 


Figure 5: Summary of SVM Model 


Figure 6 explains Linear Regression Model. It is the traditional method for fitting a statistical model to data. It is 
appropriate since the target variable “attrition status” is numeric. 



Results and Discussion 

The present investigation employed different prediction algorithms to analyze employee attrition status and likelihood of 
retention-attrition of employees. The performance of the model is evaluated in terms of Error Matrix and Pseudo R Square 
estimate of error rate. An error matrix shows the true outcomes against the predicted outcomes. It is also known as confusion 
matrix. Table 2 explains performance analysis of these classifiers in terms of error matrix. 
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Model 

Error Matrix 

Decision Tree 

Predicted 

Actual 0 1 Error 

0 1662 21 1.2 

1 41 525 7.2 

Random Forest 

Predicted 

Actual 0 1 Error 

0 1650 3 0.2 

1 15 551 2.7 

Support Vector Machine 

Predicted 

Actual 0 1 Error 

0 1641 42 2.5 

1 50 516 S.3 

Liner Model 

Predicted 

Actual 0 1 Error 

0 1553 125 7.4 

1 396 170 70.0 


Figure 7, the “Predicted versus Observed” plot shows the performance analysis of all the four models. The plot displays the 
predicted values against the observed values. The Pseudo R-Squared, square of the correlation between the predicted and 
observed values. The closer to 1, is the acceptable one. Table 3 gives Pseudo R-Square values for these four models. 
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Figure 7: “Predicted versus Observed plot” for classifiers 
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Table 1: Performance accuracy of classifiers 


Classifier 

Pseudo R-square 

Decision Tree 

0.8473 

Random Forest 

0.9773 

Support Vector Machine 

0.8315 

Linear Regression 

0.2299 


Confusion matrix and “Predicted versus Observed" plot concludes that Random Forest is the appropriate model for analysis of 
Employee attrition as compared to the other algorithms considered in this study and the underlined data. Figure 8 explains the 
relative importance of HR dataset attributes using Gini importance and Permutation importance measures. Based on these two 
measures, it reveals that employees' “satisfaction level" is the predominant predictor of employee attrition. 


Variable Importance 

MeanDecreaseAccuracy 


satisfaction Jevel - | 

time_spend_company- | 
number_project- I 
average_montly_hours - | 
last_evaluation - | 
sales- | 
salary- | 
Work_accident- | 
promotion_last_5years - | 

satisfactionjevel - | 

time_spend_company- | 
number_project- | 
average_montly_hours - | 
last_evaluation - | 
sales- | 
salary- | 
Work_accident- | 
promotion_last_5years - | 




■ 


MeanDecreaseGini 





Relative Importance 


Figure 8: Dependency of employee attrition status on other attributes 


Conclusion 

Authors have explored a machine learning solution for HR 
attrition analysis and forecast. Present study exhibits 
performance estimation of various classification algorithms 
and compares the classification accuracy. The performance 
of the model is evaluated in terms of Error Matrix and 
Pseudo R Square estimate of error rate. Performance 
accuracy revealed that Random Forest model can be 
effectively used for classification. The result also concludes 
that employee attrition depends more on employees' 
satisfaction level as compared to other attributes. 
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